专利摘要:
  It is an apparatus for generating a bandwidth-enhanced audio signal from an incoming audio signal (50) which has an input audio signal frequency range, comprising: a raw signal generator (10 ) configured to generate a raw signal (60) that has a boost frequency range, where the boost frequency range is not included in the input audio signal frequency range; a neural network processor (30) configured to generate a parametric representation (70) for the boost frequency range using the input audio frequency range of the input audio signal and a trained neural network (31); and a raw signal processor (20) to process the raw signal (60) using parametric representation (70) for the boost frequency range to obtain a processed raw signal (80) that has frequency components in the frequency range intensification, wherein the processed raw signal (80) or the processed raw signal and the input audio signal frequency range of the input audio signal represent the bandwidth-enhanced audio signal.
公开号:BR112020008216A2
申请号:R112020008216-3
申请日:2018-04-13
公开日:2020-10-27
发明作者:Konstantin Schmidt;Christian Uhle;Bernd Edler
申请人:Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.;
IPC主号:
专利说明:

[001] [001] The present invention is related to audio processing and, in particular, to bandwidth enhancement technologies for audio signals such as bandwidth extension or smart gap filling.
[002] [002] The codec most used today for mobile speech communication is still AMR-NB which encodes only frequencies from 200 to 3400 Hz (usually called narrow band, (NB)). The human speech signal, however, has a much larger bandwidth - especially fricatives often have most of their energy above 4 kHz. Limiting the speech frequency range will not only sound less pleasant, it will also be less intelligible [1, 2].
[003] [003] State of the art audio codecs such as EVS [3] can encode a much larger frequency range of the signal, but the use of these codecs will require a change in the entire communication network including the receiving devices. This is a huge effort and has been known to last for several years. Bandwidth extensions (BBWE - also known as artificial bandwidth extension or blind bandwidth expansion) have the ability to extend the frequency range of a signal without the need for additional bits. The same applied to the decoded signal only and do not need any adaptation of the network or the sending device. Although it is a pleasant solution to the limited bandwidth problem of narrowband codecs, many of the systems fail to improve the quality of speech signals. In a joint assessment of the latest bandwidth extensions, only four out of 12 systems managed to significantly improve perceived quality for all languages tested
[4] [4].
[004] [004] Following the speech production filter source model, most bandwidth extensions (blind or not blind) have two main building blocks - generating an excitation signal and estimating the vocal format. This is also the approach that the presented system follows. A technique commonly used to generate the excitation signal is spectral folding, translation or non-linear processing. The vocal format can be generated by Gaussian Mixing Models (GMM), Hidden Markov Models (HMM), Neural Networks or Deep Neural Networks (DNN). These models predict the vocal format of resources calculated in the speech signal.
[005] [005] In [5] and [6], the excitation signal is generated by spectral bending and the vocal tract filter is performed as an entire pole filter in time domain by an HMM. First, a linear prediction coefficient (LPC) codebook calculated in frames containing the upper band speech signal is created by vector quantization. On the decoder side, resources are calculated on the decoded speech signal and an HMM is used to model the conditional probability of a codebook entry given the resources. The final envelope is the weighted sum of all codebook entries with the probabilities being the weights. In [6], fricative sounds are further emphasized by a neural network.
[006] [006] In [7], the excitation signal is also generated by spectral doubling and the vocal tract is modeled by a neural network that emits gains applied to the doubled signal in a Mel filter bank domain.
[007] [007] In [8], a DNN is used to predict the spectral envelope of a spectral folded excitation signal (phrased here as an imaged phase). The system in [9] also uses the spectral folded excitation signal and models the envelope by a DNN that comprises layers of LSTM. Using multiple audio frames as input to the DNN, these two systems have an algorithmic delay too high for real-time telecommunication.
[008] [008] A recent approach directly models the missing signal in time domain [10] with an algorithmic delay from 0 to 32 ms with an architecture similar to WaveNet [11].
[009] [009] When speech is transmitted for telecommunication, its frequency range is generally limited, for example, opposing bandwidth limitation and downward sampling. If this bandwidth limitation removes too much bandwidth from the signal, the perceived quality of speech is significantly reduced. One way to overcome this would involve changing the codec by transmitting more bandwidth. This often involves changing the total network infrastructure which is very costly and can take several years.
[010] [010] Another way to extend the frequency is to extend the frequency range artificially by extending the bandwidth. In the event that the bandwidth extension is blind, no side information is transmitted from the encoder to the decoder. No changes have to be made to the transmission infrastructure.
[011] [011] It is the aim of the present invention to provide an improved concept for generating an audio signal enhanced by bandwidth.
[012] [012] This objective is achieved by an apparatus for generating a bandwidth-enhanced audio signal according to claim 1, a system for processing an audio signal according to claim 26 or claim 27, a generation method a bandwidth-enhanced audio signal according to claim 29 or a method of processing an audio signal according to claim 30 or claim 31, or a computer program according to claim 32.
[013] [013] The present invention is based on the finding that a neural network can be used advantageously to generate an extended bandwidth audio signal width. However, the neural network processor that deploys the neural network is not used to generate the full intensification frequency range, that is, the individual spectral lines in the intensification frequency range. Instead, the neural network processor receives the input audio signal frequency range as an input and outputs a parametric representation for the boost frequency range. This parametric representation is used to perform raw signal processing of a raw signal that was generated by a separate raw signal generator. The raw signal generator can be any type of signal synthesizer for the boost frequency range, such as a patch applicator as known from the bandwidth extension, such as spectral band replication procedures or intelligent gap filling. The signal applied with a patch can then be whitened spectrally, or, alternatively, the signal can be whitened spectrally before being applied with a patch. And then, this raw signal, which is a signal applied with a whitish patch in a spectral way, is further processed by the raw signal processor using the supplied parametric representation of the neural network in order to obtain the processed raw signal that has frequency components in the range intensification frequency. The boost frequency range is a high band when applying a direct bandwidth extension where the input audio signal is a narrow band or narrow band signal. Alternatively, the intensification frequency range refers to certain spectral holes between the maximum frequency and a certain minimum frequency that are filled by the smart gap filling procedures.
[014] [014] Alternatively, the raw signal generator can also be deployed to generate a frequency range of signal intensification using any type of non-linear processing or noise processing or noise generation.
[015] [015] Since the neural network is only used to provide a parametric representation of the high band instead of the total high band or the full intensification frequency range, the neural network can become less complex and therefore efficient compared to other procedures in which a neural network is used to generate the full high-band signal. On the other hand, the neural network is provided with the narrowband signal and therefore an additional resource extraction from the narrowband signal as is also known from the neural network controlled bandwidth extension procedures is not required. In addition, it was found that the generation of the raw signal for the intensification frequency range can be carried out directly and, therefore, very efficient without neural network processing, and the subsequent scaling of this raw signal or, in general, subsequent raw signal processing can also be performed without any specific neural network support. Instead, neural network support is only needed to generate the parametric representation for the signal intensifying frequency range and therefore an ideal compromise is found between conventional signal processing on the one hand to generate the raw signal for the range of intensification frequency and the modeling or processing of the raw signal and, in addition, the processing of unconventional neural network that, in the end, generates the parametric representation that is used by the raw signal processor.
[016] [016] The distribution between conventional processing and neural network processing provides an ideal compromise in terms of audio quality, and neural network complexity in relation to neural network training as well as the application of neural network that has to be performed in any bandwidth-enhancing processors.
[017] [017] Preferred modalities are based on different time resolutions, that is, a very low time resolution and, preferably, a very high frequency resolution to generate the whitish raw signal. On the other hand, the neural network processor and the raw signal processor operate based on a high time resolution and therefore, preferably, a low frequency resolution. However, this may also be the case where low time resolution is accompanied by high frequency resolution or high time resolution.
[018] [018] An additional preferred aspect of the present invention is based on a certain useful whitening procedure that divides the raw signal originally generated by its spectral envelope generated by low-pass filtration or generally power spectrum FIR with a low-pass filter. very easy such as a three, four or five lead low pass filter where all leads are set to 1 only. This procedure serves two purposes. The first is that the shape structure is removed from the original raw signal and the second purpose is that the ratio of harmonic energy to noise is decreased. In this way, such a whitish signal will sound much more natural than, for example, a residual LPC signal, and such a signal is particularly suitable for parametric processing using the parametric representation generated by the neural network processor.
[019] [019] An additional aspect of the present invention is based on the advantageous modality, in which the neural network processor is not provided with the amplitude spectrum, but is fed with the power spectrum of the input audio signal. Furthermore, in this modality, the neural network processor emits a parametric representation and, for example, spectral envelope parameters in a compressed domain, such as a LOG domain, a square root domain or a domain of () 1/3 . So, the training of the neural network processor is more related to human perception, since human perception operates in a compressed domain instead of a linear domain. On the other hand, the parameters generated in this way are converted into a linear domain by the raw signal processor so that, in the end, a linearly processed spectral representation of the signal intensification frequency range is obtained, although the neural network operates with a spectrum power or even a noise spectrum (the amplitudes are raised to the power of 3) and the parametric representation of parameters or at least part of the parametric representation of parameters is emitted in the compressed domain, such as a LOG domain or a domain () 1 / 3.
[020] [020] An additional advantageous aspect of the present invention is related to the implantation of the neural network by itself. In one embodiment, the input layer of the neural network receives in representation of two-dimensional time / frequency of the amplitude spectrum or, preferably, the power or the noise spectrum. Thus, the input layer in the neural network is a two-dimensional layer that has the total frequency range of the input audio signal and, in addition, also having a certain number of preceding frames. This entry is preferably deployed as a convolutional layer that has one or more convolutional kernels which, however, are quite small convolutional kernels that convolute, for example, only less than or equal to five frequency bins and less than or equal to 5 time frames , that is, the five or less frequency bins of only five or less frames of time. This convolutional input layer is preferably followed by a convolutional layer or an additional delacted convolutional layer that may or may not be enhanced by residual connections. In one embodiment, the output layer of the neural network that emits parameters for parametric representation in, for example, values in a given range of values can be a convolutional layer or a layer fully connected to a convolutional layer so that any layers Recurrents are not used in the neural network. Such neural networks are, for example, described in “An empiric evaluation of generic convolutional and recurrent networks for sequence modeling” by S. by Bai et al, March 4, 2018, arXiv: 1803.0127 lvl [cs. LG]. Such networks described in this publication are not all based on recurring layers, but are based only on certain convolutional layers.
[021] [021] However, in an additional embodiment, recurrent layers such as LSTM layers (or GRU layers) are used in addition to one or more convolutional layers. The last layer or layer of the network output may or may not be a layer fully connected with a linear output function. This linear output function allows the network to issue unlimited continuous values. However, such a fully connected layer is not necessarily necessary, since a reduction from the two-dimensional (large) input layer to the one-dimensional output parameter layer by time index can also be accomplished by adapting two or more upper convolutional layers or by adapting two or more recurring layers such as LSTM or GRU layers are specifically used.
[022] [022] Additional aspects of the present invention relate to the specific application of the inventive bandwidth intensification apparatus, such as for a blind bandwidth extension only for concealment, that is, when a loss of frame has occurred. Here, the audio codec can have an unblinded bandwidth extension or no bandwidth extension at all, and the inventive concept provides for a part of the missing signal due to a loss of frame or provides for the complete missing signal.
[023] [023] Alternatively, inventive processing that uses a neural network processor is not only used as a fully blind bandwidth extension, but is used as a part of a non-blind bandwidth extension or smart gap fill, where a parametric representation generated by the neural network processor is used as a first approximation that is refined, for example, in the parameter domain by some kind of data quantization controlled by a very small number of bits transmitted as additional side information, such as a single bit per selected parameter such as the spectral envelope parameters. In this way, an extremely low bit rate guided extension is obtained which, however, relies on neural network processing within the encoder to generate the additional low bit rate side information and, at the same time, operates in the decoder to provide the parametric representation of the incoming audio signal, and then that parametric representation is refined by the additional very low bit rate side information.
[024] [024] Additional modalities provide a blind bandwidth extension (BBWE) that expands the phone speech bandwidth which is often limited to 0.2 to 3.4 kHz. The advantage is an increased perceived quality as well as increased intelligibility. One modality has a blind extension similar to the state of the art bandwidth enhancement as well as in the filling of intelligent gap or bandwidth extension or spectral band replication with the difference that all processing is performed in the decoder without the need to transmit extra bits. Parameters as spectral envelope parameters are estimated by a regressive convolutional deep neural network (CNN) with long-term memory (LSTM). In one embodiment, the procedure operates on 20 ms frames without additional algorithmic delay and can be applied to state of the art speech and audio codecs. These modalities explore the performance of convolutional and recurrent networks to model the spectral envelope of speech signals.
[025] [025] The preferred embodiments of the present invention are subsequently discussed in relation to the accompanying drawings, in which: Figure 1 is a block diagram for an apparatus for generating a bandwidth-enhanced audio signal for an incoming audio signal ; Figure 2a is a preferred feature of the gross signal generator of Figure 1; Figure 2b is a preferred implementation of the apparatus of Figure 1, in which different time resolutions are applied to the raw signal generator on the one hand and the neural network processor and the raw signal processor on the other hand; Figure 2c is a preferred implementation of performing a spectral whitening operation inside the raw signal generator using a low-pass filter through frequency; Figure 2d is a sketch illustrating the spectral situation of a preferred two-stroke copy operation; Figure 2e illustrates spectral vectors used for the purpose of generating raw signal and used for the purpose of processing raw signal using the parametric representation emitted by the neural network processor; Figure 3 is a preferred deployment of the raw signal generator; Figure 4 is a preferred deployment of the apparatus for generating a bandwidth-enhanced audio signal in accordance with the present invention; Figure 5 is a preferred embodiment of the neural network processor; Figure 6 is a preferred embodiment of the raw signal processor; Figure 7 is a preferred arrangement of the neural network; Figure 8a is a sketch that compares the performance of different DNN configurations; Figure 8b is an illustration showing a training set error and test set dependent on the amount of data; Figure 8c illustrates ACR hearing test results displayed as MOS values; Figure 9a illustrates a principle of a convolutional layer;
[026] [026] Figure 1 illustrates a preferred embodiment for an apparatus for generating a bandwidth-enhanced audio signal from an input audio signal 50 that has an input audio signal frequency range. The frequency range of the incoming audio signal can be a low band range or a full band range, but with smaller or larger spectral holes.
[027] [027] The apparatus comprises a raw signal generator 10 for generating a raw signal (60) which has an intensifying frequency range, in which the intensifying frequency range is not included in the input audio signal frequency range . The apparatus further comprises a neural network processor 30 configured to generate a parametric representation 70 for the boost frequency range which uses the input audio signal frequency range of the input audio signal and which uses a trained neural network. The apparatus further comprises a raw signal processor 20 for processing raw signal 60 using parametric representation 70 for the boost frequency range to obtain a processed raw signal 80 that has frequency components in the boost frequency range. In addition, the device comprises, in a given implantation, an optional combiner 40 that emits the audio signal intensified by bandwidth such as a signal with a low band and high band or a signal of full band without spectral holes or with less spectral holes than before, that is, compared to the input audio signal 50.
[028] [028] The processed raw signal 80 can already be, depending on the processing of the raw signal processor, the extended bandwidth signal, when the combination of the processed raw signal and the input audio signal frequency range is, for example, example, performed within a spectrum time conversion as, for example, discussed in relation to Figure 4. So, the combination is already performed by this spectral time converter and the combiner 40 in Figure 1 is part of that spectral time converter . Alternatively, the raw processed signal can be a time domain boost signal which is combined with the input audio time domain signal by a separate combiner which would then perform a sample addition of two time domain signals. Other procedures for combining an intensification signal and the original input signal are known to those skilled in the art.
[029] [029] In addition, it is preferred that the raw signal generator uses the input audio signal to generate the raw signal as illustrated by the dashed line 50 leading to the raw signal generator 10. Procedures that operate using the audio signal from input are patching operations such as copy operations, harmonic patching operations, mixes of copy operations and harmonic patching operations, or other patching operations that, at the same time, mirror the spectrum.
[030] [030] Alternatively, the raw signal generator can operate without reference to the input audio signal. Then, the raw signal generated by the raw signal generator 10 can be a signal that is similar to noise, and the raw signal generator would comprise some type of noise generator or some type of random function that generates noise. Alternatively, the input audio signal 50 could be used and could be processed by some kind of non-linearity in the time domain, such as sgn (x) times x2, where sgn () is the signal of x. Alternatively, other non-linear processing would be stapling procedures or other time-domain procedures. Further processing would be a preferred frequency domain procedure that performs a transferred frequency version of the band-limited input signal such as a copy, spectral domain mirroring or the like. However, mirroring in the spectral domain could also be performed by time domain processing operations in which zeros are inserted between samples and, for example, a zero is inserted between two samples, a mirroring of the spectrum is obtained. When two zeros are inserted between two samples, then this would constitute an unmatched copy operation on an upper spectral copy, etc. In this way, it becomes clear that the raw signal generator can operate in the time domain or in the spectral domain in order to generate a raw signal within the intensification frequency range that is preferably a whitish signal as shown in relation to Figure 2a . However, this whitening does not necessarily have to be carried out in the spectral domain, but it could also be carried out in the time domain, such as by LPC filtration, and then the residual LPC signal would be an whitish time domain signal. However, as will be highlighted later, a given spectral domain whitening operation is preferred for the purpose of the present invention.
[031] [031] In a preferred implantation, the neural network processor receives, as an input, the audio signal or, in particular, a sequence of frames of spectral values of the audio signal, where the spectral values are both amplitude values, but they are, more preferably, power values, that is, spectral values or high amplitudes at a given power, where the power is, for example, 2 (power domain) or 3 (noise domain), but generally power between 1.5 and 4.5 can be used to process the spectral values before feeding them into the neural network. This is, for example, illustrated in Figure 5 in item 32 which illustrates the power spectrum converting to convert a low bandwidth spectral frame sequence into a spectral frame time sequence, and then the sequence time frame. spectral frames, whether linear amplitudes or amplitudes of power or amplitudes of noise are inserted in the trained neural network 31 that emits parametric data preferably in the compressed domain. Such parametric data can be any parametric data that describes the missing signal or bandwidth boost signal as hue parameters, temporal envelope parameters, spectral envelope parameters, such as scale factor band energies, quantizer values of distribution, energy or slope values. Other parameters that are, for example, known from spectral band replication processing are inverse filtering parameters, noise addition parameters or missing harmonic parameters that can also be used in addition to spectral envelope parameters. Preferred spectral envelope parameters or a type of parametric “baseline” representation are spectral envelope parameters and, preferably, absolute energies or power for a number of bands. In the context of a true bandwidth extension where the input audio signal is only a narrow band signal, the boost range could, for example, have only four or five bands or, at most, ten boost bands. , and then, the parametric representation would only consist of a value related to energy or power or single amplitude per band, that is, ten parameters for ten example bands.
[032] [032] In one embodiment, the bandwidth extension can be used as an extension of any type of speech and audio codec, such as a 3GPP (EVS) or MPEG AAC enhanced voice service. The entry in the bandwidth extension processing illustrated in Figure 1 is the audio signal limited by band decoded and exemplifier. The output is an estimate of the missing signal. The estimate could be the signal as a waveform or the coefficients of a transform, such as an FFT or a modified discrete cosine transform (MDCT) or similar. The parameters generated by the neural network processor 30 are the parameters of the parametric representation 70 that were discussed in an exemplary manner previously.
[033] [033] Where the signal is described by some raw parameters, the artificial signal is generated and is then modified by the parameters estimated by the neural network processor 30.
[034] [034] Figure 2a illustrates a preferred procedure performed by the raw signal generator 10. In a step 11a, the raw signal generator generates a signal with a first tone, and in an additional step 11b, the raw signal generator whitens by spectral mode the signal with the first tone to obtain a signal with a second low tone. In other words, the hue of the second signal is less than the hue of the first signal or / and the signal obtained by step 11b is whiter than the signal generated by step 11a.
[035] [035] Furthermore, Figure 2b illustrates a particular preferential implementation of the cooperation between the raw signal generator 10 on the one hand and the neural network processor 30 and the raw signal processor 20 on the other hand. As highlighted in 12, the raw signal generator generates a raw signal with a first time resolution (low), and as highlighted in 32, the neural network processor 30 generates parametric data with a second time resolution (high), and the raw signal processor 20 then schedules or processes the raw signal with the second or high time resolution according to the time resolution of the parametric representation. Preferably, the time resolution in blocks 32 and 22 is the same, but, alternatively, these blocks could still be based on different time resolutions, provided that the time resolution of block 32 is greater than the resolution of whitening time. spectral used in step 12, and provided that the time resolution used to scale / process the raw signal is greater than the time resolution of the raw signal generation illustrated in block 12 in Figure 2b. Thus, there are, in general, two modalities, that is, the raw signal is generated with low time resolution and processing and the neural network is performed with high time resolution, or the raw signal is generated with resolution high frequency and processing and neural network is performed with low frequency resolution.
[036] [036] Figure 2d illustrates a situation of the spectra in a deployment, where the input signal is a narrow band input signal, for example, between 200 Hz and 3.4 kHz, and the bandwidth intensification operation is an extension of true bandwidth. Here, the input audio signal is inserted into a time to frequency converter 17 illustrated in Figure 3. Then, a patch application by a patch applicator 18 is performed and, subsequent to patching, a whitening step 11b is performed and then the result is converted into the time domain by a frequency to time converter. Block output 19 of Figure 3 can be just a raw time domain signal or a raw time domain signal and an audio input signal. In addition, it should be noted that the order of operations between whitener 11b and patch applicator 18 can be changed, that is, that the whitener can operate on the signal emitted by the time to frequency converter, that is, the narrowband or incoming audio signal, and subsequently the whitish signal is patched both once and as shown in Figure 2d twice, that is, for a first copy and a second copy so that the Full intensification frequency range consists of the frequency range of the first copy operation and the second copy operation. Of course, patch applicator 18 in Figure 3 does not necessarily have to perform the copy operation, but it could also perform a spectral spreading operation or any other operation to generate a signal in the intensification frequency range that is whitened before or after the generation.
[037] [037] In a preferred embodiment, the spectral whitening operation illustrated in 11b in Figure 2b or illustrated in 11b in Figure 3 comprises the procedures illustrated in Figure 2c. A linear spectral frame, such as generated by the time to frequency converter 17 of Figure 3, which can be an FFT processor, an MDCT processor or any other processor to convert a time domain representation into a spectral representation is inserted in a linear converter for power 13. The output of the linear converter for power 13 is a power spectrum. Block 13 can apply any power operation, such as an operation with a power of 2, or 3 or, generally, a value between 1.5 and 4.5, although a value of 2 is preferred to obtain a power spectrum. at the output of block 13. A power frame is then filtered at low pass through the frequency by the low pass filter to obtain the spectral power envelope estimate.
[038] [038] Then, in block 15, the spectral power envelope estimate is converted back to the linear domain using a power to linear 15 converter, and the linear spectral envelope estimate is then entered into a whitening calculator. 16 that also receives the linear spectral frame in order to emit the whitish spectral frame that corresponds to the gross signal or a gross signal spectral frame in a preferred implantation. In particular, the linear spectral envelope estimate is a given linear factor for each spectral value of the linear spectral frame and, therefore, each spectral value of the linear spectral frame is divided by its corresponding weighting factor included in the linear spectral envelope estimate emitted by the block 15.
[039] [039] Preferably, the low-pass filter 14 is an FIR filter that has, for example, only 3, 4 or 5 leads or, at most, 8 leads and, preferably, at least 3 leads have the same value and are preferably equal to 1 or even all 5 or, in general, all filter leads are equal to 1 in order to obtain a low-pass filter operation.
[040] [040] Figure 2e illustrates a processing performed in the context of the operation of the system in Figure 4.
[041] [041] A basic acoustic model of the human speech production process combines an excitation signal similar to a periodic pulse (the larynx signal) modulated by a transfer filter determined by the shape of the supra-laryngeal vocal tract. In addition, there are noise-like signs that result from turbulent airflow caused by constriction of the vocal tract or by the lips. Based on this model, the missing frequency range is extended by extending a flat excitation signal spectrally and then modeling it with an estimate of the vocal tract filter. Figure 1 depicts the proposed system. From the decoded time domain signal, blocks of 20 ms are transformed by a DFT to the frequency domain. The frame increment (jump size) of adjacent frames is 10 ms. In the frequency domain, the signal is sampled upwards to 16 kHz by zero padding and the missing frequency content above 3.4 kHz is generated in the same way as in bandwidth extensions like Smart Gap Padding (IGF) or SBR [12, 13]: the lower bins are copied to create the missing signal. Since codecs like AMR-NB only encode frequencies between 200 and 3400 Hz, this signal is not sufficient to fill the missing range from 8000 to 3200 = 4800 Hz. Therefore, this operation has to be performed twice - the first time to fill the 3400 to 6600 Hz range and again to fill the 6600 to 8000 Hz range.
[042] [042] This artificially generated signal is too tonal compared to the original excitation signal. A low complex method used in IGF is used to reduce the hue [14]. The idea here is to divide the signal by its spectral envelope generated by filtering the power spectrum with FIR. This serves two purposes - first, the shape structure is removed by the copied signal (this could also be achieved with the use of residual LPC), second, the ratio of harmonic energy to noise is decreased. Therefore, this signal will sound much more natural.
[043] [043] After an inverse DFT of twice the size of the initial DFT, the time domain signal with 16 kHz sampling frequency is generated by adding block overlap with 50% overlap. This time domain signal with a flat excitation signal above 3400 Hz will now be modeled to resemble the original signal format structure. This is done in the frequency domain of a DFT with higher time resolution operating in 10 ms blocks. Here the signal in the range from 3400 to 8000 Hz is divided into 5 bands of approximately 1 bark width [15] and each DFT-bin Xi within band b is scaled by a scaling factor fb:
权利要求:
Claims (32)
[1]
1. Apparatus for generating a bandwidth-enhanced audio signal from an input audio signal (50) which has an input audio signal frequency range characterized by comprising: a raw signal generator (10) configured to generate a raw signal (60) that has a boost frequency range, where the boost frequency range is not included in the input audio signal frequency range; a neural network processor (30) configured to generate a parametric representation (70) for the boost frequency range using the input audio frequency range of the input audio signal and a trained neural network (31); and a raw signal processor (20) to process the raw signal (60) using parametric representation (70) for the boost frequency range to obtain a processed raw signal (80) that has frequency components in the frequency range intensification, wherein the processed raw signal (80) or the processed raw signal and the input audio signal frequency range of the input audio signal represent the bandwidth-enhanced audio signal.
[2]
Apparatus according to claim 1, characterized in that the raw signal generator (10) is configured to generate (11a) an initial raw signal that has a first tone; and spectral whitening (11b) the initial raw signal to obtain the raw signal, the raw signal having a second key, the second key being less than the first key.
[3]
Apparatus according to claim 1 or 2, characterized in that the raw signal generator (10) is configured to perform a spectral whitening of the initial raw signal using a first time resolution (12) or to generate the signal raw (60) using a first time resolution, or in which the raw signal generator (10) is configured to perform a spectral whitening of the initial raw signal using a first frequency resolution (12) or to generate the raw signal (60) using a first frequency resolution, and in which the neural network processor (30) is configured to generate (32) the parametric representation in a second time resolution, with the second time resolution being greater than the first time resolution, or where the neural network processor (30) is configured to generate (32) the parametric representation in a second frequency resolution, with the second frequency resolution being less than the first rhesus frequency resolution, and where the raw signal processor (20) is configured to use (22) the parametric representation with the second time resolution or frequency resolution to process the raw signal in order to obtain the processed raw signal (80 ).
[4]
Apparatus according to any one of claims 1 to 3, characterized in that the raw signal generator (10) comprises a patch applicator (18) for patching a spectral portion of the incoming audio signal in the range of intensification frequency, where the patching comprises a single patching operation or a multiple patching operation, in which, in the multiple patching operation, a specific spectral portion of the incoming audio signal is applied with patch to two or more spectral portions of the intensification frequency range.
[5]
Apparatus according to any one of claims 1 to 4, characterized in that the raw signal processor (20) comprises a time to frequency converter (17) for converting an input signal into a spectral representation, the representation being The spectral frame comprises a time sequence of spectral frames, with a spectral frame having spectral values, in which the neural network processor (30) is configured to supply the spectral frames in the trained neural network (31) or to process (32) the spectral frames to obtain processed spectral frames, in which the spectral values are converted into a power domain that has a power between 1.5 and 4.5, and which preferably has a power of 2 or 3, and in which the neural network (31) is configured to output the parametric representation in relation to the power domain, and in which the raw signal processor (20) is configured to convert (26) the parametric representation into a linear domain r and to apply (27) the linear parametric representation domain to the time sequence of spectral frames.
[6]
Apparatus according to any one of claims 1 to 5, characterized in that the neural network processor (30) is configured to output the parametric representation (70) in a log representation or a compressed representation that has been associated with a power less than 0.9, and where the raw signal processor (20) is configured to convert (26) the parametric representation of the log representation or the compressed representation into a linear representation.
[7]
Apparatus according to any one of claims 1 to 6, characterized in that the raw signal generator (10) comprises: a time-to-frequency converter (17) for converting the incoming audio signal into a sequence of spectral frames , being that a spectral frame has a sequence of values; a patch applicator (18) to generate an application with a patch signal for each spectral frame using the time to frequency converter output (17); a whitening stage (11b) to spectrally whiten the patch signal for each spectral frame or to whiten a corresponding signal from the time to frequency converter (17) before performing the patching operation by the patch; and a frequency converter for time (19) for converting a sequence of frames comprising patched and spectral whitened frames in a time domain to obtain the raw signal (60), where the frequency converter for time is configured to accommodate the boost frequency range.
[8]
Apparatus according to any one of claims 1 to 7, characterized by a whitening stage (11b) within the raw signal processor comprising: a low-pass filter for filtering in low-pass (14) a spectral frame or a power representation (13) of the spectral frame to obtain an envelope estimate for the spectral frame; and a calculator to calculate (16) a whitish signal by dividing the spectral frame by the envelope estimate, in which, when the envelope is derived from the power representation, the divisor calculates linear weighting factors for spectral values (15) and divides the spectral values by the linear weighting factors.
[9]
Apparatus according to any one of claims 1 to 8, characterized in that the raw signal processor (20) comprises a time to frequency converter (22) for converting the input audio signal or a signal derived from the input signal. input audio and raw signal (60) in a spectral representation, where the neural network processor (30) is configured to receive a spectral representation of the input audio signal frequency range, in which the raw signal processor (20) comprises a spectral processor (23) for applying the parametric representation (70) provided by the neural network processor (30) in response to the spectral representation of the input audio signal frequency range to the spectral representation of the raw signal (60 ); and wherein the raw signal processor (20) further comprises a frequency to time converter (24) for converting a processed spectral representation of the raw signal in the time domain,
wherein the apparatus is configured to perform a combination of the raw processed signal and the frequency range of the incoming audio signal by providing the processed spectral representation and the spectral representation of the frequency range of the incoming audio signal to the frequency converter time (24) or combining a time representation of the input audio signal frequency range and a time representation of the raw processed signal (80) in the time domain.
[10]
Apparatus according to any one of claims 1 to 9, characterized in that the neural network processor (30) comprises a neural network (31) with an input layer (32) and an output layer (34), in that the neural network processor is configured to receive, in the input layer, a spectrogram derived from the input audio signal, the spectrogram comprising a time sequence of spectral frames, with a spectral frame having a number of spectral values , and to output, in the output layer (34), individual parameters of the parametric representation (70), where the spectral values are linear spectral values or spectral power values processed using a power between 1.5 and 4.5 or power values processed, where the processing comprises a compression using a log infusion or a power function with a power less than 1.
[11]
Apparatus according to claim 10, characterized in that the input layer (32) or one or more intermediate layers (33) is formed as a convolutional layer comprising one or more convolutional kernels, in which a convolutional kernel is configured to perform convolutional processing of a number of spectral values from at least two different frames in the time frame of spectral frames.
[12]
12. Apparatus according to claim 11, characterized in that the convolutional kernel is configured to perform a two-dimensional convolutional processing that involves a first number of spectral values per frame and a second number of frames in the frame time sequence, in which the first number and second number are at least two and less than ten.
[13]
Apparatus according to claim 11 or 12, characterized in that the input layer (32) or the first intermediate layer (33) comprises at least one kernel that processes spectral values that are adjacent in frequency and adjacent in time, and wherein the neural network (31) additionally comprises an intermediate convolutional layer (33b) that operates based on a delay factor so that, in relation to a time index, only every second or every third result of a layer precedent in a stack of layers is received by the convolutional layer as input.
[14]
Apparatus according to any one of claims 10 to 13, characterized in that the neural network comprises, as the output layer (34), or, in addition to the output layer (34), a recurrent layer, wherein the layer Recurrent receives an output vector from a convolutional layer for a time index and outputs an output vector using a recurring layer function that has a memory.
[15]
Apparatus according to claim 14, characterized in that the recurring layer comprises a long / short term memory (LSTM) function or comprises a recurring door unit (GRU) function or is an IIR filter function.
[16]
Apparatus according to any one of claims 10 to 15, characterized in that the input layer (32) or one or more intermediate layers (33) comprises, for calculating, for each input, an output using a convolutional function of a convolutional layer, where the convolutional layer comprises a residual connection, so that at least one group of outputs is a linear combination of the output of the convolutional function and the input to the convolutional function.
[17]
Apparatus according to any one of claims 10 to 16, characterized in that the output layer comprises one or more fully connected layers, in which the fully connected layer or an upper fully connected layer provides, in one output, parameters of the representation parametric for a current time frame of the raw signal and in which a fully connected layer is configured to receive, in an input of the same, output values from an input layer or an intermediate layer for the current time frame.
[18]
Apparatus according to any one of claims 10 to 17, characterized in that the input layer (32) or an intermediate layer (33) is a convolutional layer that has an output data vector for each number time index integer, where the neural network (31) additionally comprises an additional convolutional layer that has one or more kernels for a denounced convolution process, where the one or more kernels for the additional convolutional layer receives at least two data vectors from the layer input or intermediate layer for time indexes that are different from each other by more than an integer value to calculate an output vector for a time index, and in which, to calculate an output vector for a next index of time, the one or more kernels receives at least two data vectors from the input layer or the middle layer for additional time indexes that are interspersed with the time indexes.
[19]
Apparatus according to any one of claims 10 to 18, characterized in that the neural network comprises: a first convolutional layer as the input layer for receiving a current frame comprising the frequency range of the incoming audio signal of the signal input audio corresponding to a current time index, in which the first convolutional layer is configured for additional use of one or more previous frames; at least one second convolutional layer to receive an output from the first convolutional layer, wherein the at least one second convolutional layer is configured to perform a denounced convolution operation to obtain a vector for a current time index; at least one recurring layer to process the vector for the current time index using a recurring function that performs a memory function that covers at least five time indexes that precede the current time index; where a recurring layer forms the output layer (34) or where the output layer (34) is a fully connected layer that receives output from a recurring layer and outputs the parameters of the parametric representation (70).
[20]
Apparatus according to any one of claims 1 to 19, characterized in that the parametric representation (70) comprises a spectral envelope value for each band within a plurality of band of intensification frequency range, wherein the plurality of band of the boost frequency range together forms the boost frequency range, and where each boost frequency band comprises at least two spectral values, and where the raw signal processor is configured to scale (27, 23) the at least two spectral values of the raw signal in a band-intensifying frequency range using a spectral envelope value for the band-intensifying frequency range.
[21]
21. Apparatus according to claim 20, characterized in that the spectral envelope value indicates a measure for an absolute energy of the intensification frequency band to which the spectral envelope value is associated, in which the raw signal processor (20 ) is configured to calculate (25) a measurement for a raw signal energy in the band-intensifying frequency range, where the raw signal processor (20) is configured to scale (27) the amplitude values using the measurement to the absolute energy so that the spectral values scaled in the intensification frequency band have an energy as indicated by the measure for the absolute energy.
[22]
Apparatus according to claim 21, characterized in that the raw signal processor (20) is configured to calculate (27) a scaling factor from the measurement for the signal energy in the intensification frequency band and the measurement for the absolute energy of the band intensification frequency range derived from the parametric representation (70).
[23]
23. Apparatus according to any one of claims 20 to 22, characterized in that the raw signal processor (20) is configured to calculate the scaled spectral values based on the following equation:
类似技术:
公开号 | 公开日 | 专利标题
BR112020008216A2|2020-10-27|apparatus and its method for generating an enhanced audio signal, system for processing an audio signal
US10249313B2|2019-04-02|Adaptive bandwidth extension and apparatus for the same
US10885926B2|2021-01-05|Classification between time-domain coding and frequency domain coding for high bit rates
JP6470857B2|2019-02-13|Unvoiced / voiced judgment for speech processing
Li et al.2018|Speech bandwidth extension using generative adversarial networks
BRPI0808202A2|2014-07-01|CODING DEVICE AND CODING METHOD.
Schmidt et al.2018|Blind bandwidth extension based on convolutional and recurrent deep neural networks
US10062390B2|2018-08-28|Decoder for generating a frequency enhanced audio signal, method of decoding, encoder for generating an encoded signal and method of encoding using compact selection side information
Skoglund et al.2019|Improving Opus low bit rate quality with neural speech synthesis
Sankar et al.2020|Design of MELPe-Based Variable-Bit-Rate Speech Coding with Mel Scale Approach Using Low-Order Linear Prediction Filter and Representing Excitation Signal Using Glottal Closure Instants
Sankar et al.2021|Mel Scale-Based Linear Prediction Approach to Reduce the Prediction Filter Order in CELP Paradigm
Nurminen2013|A Parametric Approach for Efficient Speech Storage, Flexible Synthesis and Voice Conversion
Johansen2009|Bandwidth Extension of Telephony Speech
同族专利:
公开号 | 公开日
RU2745298C1|2021-03-23|
WO2019081070A1|2019-05-02|
US20200243102A1|2020-07-30|
EP3701527A1|2020-09-02|
CN111386568A|2020-07-07|
JP2021502588A|2021-01-28|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题

FR2807897B1|2000-04-18|2003-07-18|France Telecom|SPECTRAL ENRICHMENT METHOD AND DEVICE|
WO2003019534A1|2001-08-31|2003-03-06|Koninklijke Philips Electronics N.V.|Bandwidth extension of a sound signal|
DE102008015702B4|2008-01-31|2010-03-11|Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.|Apparatus and method for bandwidth expansion of an audio signal|
US8433582B2|2008-02-01|2013-04-30|Motorola Mobility Llc|Method and apparatus for estimating high-band energy in a bandwidth extension system|
JP5777041B2|2010-07-23|2015-09-09|沖電気工業株式会社|Band expansion device and program, and voice communication device|
WO2013098885A1|2011-12-27|2013-07-04|三菱電機株式会社|Audio signal restoration device and audio signal restoration method|
CN111477245A|2013-06-11|2020-07-31|弗朗霍弗应用研究促进协会|Speech signal decoding device and speech signal encoding device|
AU2014283285B2|2013-06-21|2017-09-21|Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V.|Audio decoder having a bandwidth extension module with an energy adjusting module|WO2017222356A1|2016-06-24|2017-12-28|삼성전자 주식회사|Signal processing method and device adaptive to noise environment and terminal device employing same|
US10432240B1|2018-05-22|2019-10-01|Micron Technology, Inc.|Wireless devices and systems including examples of compensating power amplifier noise|
CN110223680B|2019-05-21|2021-06-29|腾讯科技(深圳)有限公司|Voice processing method, voice recognition device, voice recognition system and electronic equipment|
US10763905B1|2019-06-07|2020-09-01|Micron Technology, Inc.|Wireless devices and systems including examples of mismatch correction scheme|
CN110265053A|2019-06-29|2019-09-20|联想有限公司|Signal de-noising control method, device and electronic equipment|
CN110322891B|2019-07-03|2021-12-10|南方科技大学|Voice signal processing method and device, terminal and storage medium|
US11005689B2|2019-07-11|2021-05-11|Wangsu Science & Technology Co., Ltd.|Method and apparatus for bandwidth filtering based on deep learning, server and storage medium|
CN110491407B|2019-08-15|2021-09-21|广州方硅信息技术有限公司|Voice noise reduction method and device, electronic equipment and storage medium|
WO2021088569A1|2019-11-05|2021-05-14|Guangdong Oppo Mobile Telecommunications Corp., Ltd.|Convolution method and device, electronic device|
US20210241776A1|2020-02-03|2021-08-05|Pindrop Security, Inc.|Cross-channel enrollment and authentication of voice biometrics|
US10972139B1|2020-04-15|2021-04-06|Micron Technology, Inc.|Wireless devices and systems including examples of compensating power amplifier noise with neural networks or recurrent neural networks|
CN111554309A|2020-05-15|2020-08-18|腾讯科技(深圳)有限公司|Voice processing method, device, equipment and storage medium|
WO2021255153A1|2020-06-19|2021-12-23|Rtx A/S|Low latency audio packet loss concealment|
CN113035211B|2021-03-11|2021-11-16|马上消费金融股份有限公司|Audio compression method, audio decompression method and device|
CN113423005A|2021-05-18|2021-09-21|电子科技大学|Motion-driven intelligent music generation method and system|
法律状态:
2021-12-07| B350| Update of information on the portal [chapter 15.35 patent gazette]|
优先权:
申请号 | 申请日 | 专利标题
EP17198997|2017-10-27|
EP17198997.3|2017-10-27|
PCT/EP2018/059593|WO2019081070A1|2017-10-27|2018-04-13|Apparatus, method or computer program for generating a bandwidth-enhanced audio signal using a neural network processor|
[返回顶部]